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Abstract 

Most state-of-the-art named entity recog¬ 
nition (NER) systems rely on handcrafted 
features and on the output of other NLP 
tasks such as part-of-speech (POS) tag¬ 
ging and text chunking. In this work we 
propose a language-independent NER sys¬ 
tem that uses automatically learned fea¬ 
tures only. Our approach is based on the 
CharWNN deep neural network, which 
uses word-level and character-level rep¬ 
resentations (embeddings) to perform se¬ 
quential classification. We perform an ex¬ 
tensive number of experiments using two 
annotated corpora in two different lan¬ 
guages: HAREM I corpus, which contains 
texts in Portuguese; and the SPA CoNLL- 
2002 corpus, which contains texts in Span¬ 
ish. Our experimental results shade light 
on the contribution of neural character em¬ 
beddings for NER. Moreover, we demon¬ 
strate that the same neural network which 
has been successfully applied to POS tag¬ 
ging can also achieve state-of-the-art re¬ 
sults for language-independet NER, us¬ 
ing the same hyperparameters, and with¬ 
out any handcrafted features. Eor the 
HAREM I corpus, CharWNN outperforms 
the state-of-the-art system by 7.9 points in 
the El-score for the total scenario (ten NE 
classes), and by 7.2 points in the El for the 
selective scenario (five NE classes). 

1 Introduction 

Named entity recognition is a natural language 
processing (NLP) task that consists of Ending 
names in a text and classifying them among sev¬ 
eral predefined categories of interest such as per¬ 
son, organization, location and time. Although 
machine learning based systems have been the 


predominant approach to achieve state-of-the-art 
results for NER, most of these NER systems rely 
on the use of costly handcrafted features and on 


the output of other NLP tasks (Tjong Kim Sang, 


2002| |Tjong Kim Sang and De Meulder, 2003 1 


Doddington et al., 2004| [Einkel et al., 2005) Mi- 


[lidiu et al., 2007 ). On the other hand, some recent 
work on NER have used deep learning strategies 
which minimize the need of these costly features 
dChen et al, 20T0l |Collobert et al, 20111 [P assos 


et al., 2014] Tang et al., 20141. However, as far as 


we know, there are still no work on deep learning 
approaches for NER that use character-level em¬ 
beddings. 


In this paper we approach language- 
independent NER using CharWNN, a recently 
proposed deep neural network (DNN) architecture 
that jointly uses word-level and character-level 
embeddings to perform sequential classification 
( |dos Santos and Zadrozny, 2014 1. CharWNN em¬ 
ploys a convolutional layer that allows effective 
character-level feature extraction from words of 
any size. This approach has proven to be very 
effective for language-independent POS tagging 
(|dos Santos and Zadrozny, 20141. 


We perform an extensive number of experi¬ 
ments using two annotated corpora: HAREM I 
corpus, which contains texts in Portuguese; and 
the SPA CoNLL-2002, which contains texts in 
Spanish. In our experiments, we compare the 
performance of the joint and individual use of 
character-level and word-level embeddings. We 
provide information on the impact of unsupervised 
pre-training of word embeddings in the perfor¬ 
mance of our proposed NER approach. Our exper¬ 
imental results evidence that CharWNN is effec¬ 
tive and robust for Portuguese and Spanish NER. 
Using the same CharWNN configuration used by 
dos Santos and Zadrozny (2014) for POS Tagging, 
we achieve state-of-the-art results for both cor¬ 
pora. Eor the HAREM I corpus, CharWNN out- 





























performs the state-of-the-art system by 7.9 points 
in the Fl-score for the total scenario (ten NE 
classes), and by 7.2 points in the FI for the se¬ 
lective scenario (five NE classes). This is a re¬ 
markable result for a NER system that uses only 
automatically learned features. 

This work is organized as follows. In Section 
we briefly describe the CharWNN architecture. 
Section details our experimental setup and Sec¬ 
tion discuss our experimental results. Section 
presents our final remarks. 


2 CharWNN 


CharWNN extends Collobert et al.’s (2011) neu¬ 
ral network architecture for sequential classifica¬ 
tion by adding a convolutional layer to extract 
character-level representations (|dos Santos and 


Zadrozny, 20141. Given a sentence, the network 


gives for each word a score for each class (tag) 
r G T. As depicted in Figure [TJ in order to score 
a word, the network takes as input a fixed-sized 
window of words centralized in the target word. 
The input is passed through a sequence of layers 
where features with increasing levels of complex¬ 
ity are extracted. The output for the whole sen¬ 
tence is then processed using the Viterbi algorithm 
(Viterbi, 19671 to perform structured prediction. 
For a detailed description of the CharWNN neu¬ 


ral network we refer the reader to (dos Santos and 


Zadrozny, 20141. 


2.1 Word- and Character-level Embeddings 

As illustrated in Figure [T] the first layer of the 
network transforms words into real-valued fea¬ 
ture vectors (embeddings). These embeddings are 
meant to capture morphological, syntactic and se¬ 
mantic information about the words. We use a 
fixed-sized word vocabulary and we con¬ 

sider that words are composed of characters from 
a fixed-sized character vocabulary . Given a 
sentence consisting of N words {wi,W2-, ...,wn}, 
every word Wn is converted into a vector Un = 
\j.wrd.^^wch-]^^ which is composcd of two sub¬ 
vectors: the word-level embedding G 
and the character-level embedding G of 
Wn- While word-level embeddings capture syntac¬ 
tic and semantic information, character-level em¬ 
beddings capture morphological and shape infor¬ 
mation. 

Word-level embeddings are encoded by col¬ 
umn vectors in an embedding matrix G 


and retrieving the embedding of a 
particular word consists in a simple matrix-vector 
multiplication. The matrix is a parameter 

to be learned, and the size of the word-level em¬ 
bedding is a hyperparameter to be set by the 
user. 

The character-level embedding of each word is 
computed using a convolutional layer ( [Waibel ^ 

). In Figure [TJ we il¬ 
lustrate the construction of the character-level em¬ 
bedding for the word Bennett, but the same pro¬ 
cess is used to construct the character-level em¬ 
bedding of each word in the input. The convo¬ 
lutional layer first produces local features around 
each character of the word, and then combines 
them using a max operation to create a fixed-sized 
character-level embedding of the word. 

Given a word w composed of M characters 
{ci, C 2 ,..., Cm}; we first transform each charac¬ 
ter Cm into a character embedding Character 
embeddings are encoded by column vectors in the 
embedding matrix G Given a 

character c, its embedding is obtained by the 
matrix-vector product: where 

is a vector of size | | which has value 1 at in¬ 

dex c and zero in all other positions. The input for 
the convolutional layer is the sequence of charac¬ 
ter embeddings ..., r^^}. 

The convolutional layer applies a matrix- 
vector operation to each window of size 
kchr Qj: successive windows in the sequence 
..., Fet us define the vector 

Zm G as the concatenation of the 

character embedding m, its — l)/2 left 

neighbors, and its — \)/2 right neighbors: 

The convolutional layer computes the y-th element 
of the vector which is the character-level em¬ 
bedding of w, as follows: 

max [W^Zm + b^], (D 
l<m<M G 

where G weight matrix of 

the convolutional layer. The same matrix is used to 
extract local features around each character win¬ 
dow of the given word. Using the max over all 
character windows of the word, we extract a fixed¬ 
sized feature vector for the word. 

Matrices and 1F°, and vector are pa¬ 

rameters to be learned. The size of the character 


ah, 1989t |Fecun et ah, 1998 















Figure 1: CharWNN Architecture 


vector , the number of convolutional units clu 
(which corresponds to the size of the character- 
level embedding of a word), and the size of the 
character context window are hyperparame¬ 
ters. 


2.2 Scoring and Structured Inference 


We follow Collobert et al.’s (Collobert et ah, 20111 


window approach to score all tags T for each word 
in a sentence. This approach follows the assump¬ 
tion that in sequential classification the tag of a 
word depends mainly on its neighboring words. 
Given a sentence with N words {mi, ^ 2 , 
which have been converted to joint word-level 
and character-level embedding {ui,U 2 , 
to compute tag scores for the n-th word Wn in the 
sentence, we first create a vector r resulting from 
the concatenation of a sequence of embed¬ 
dings, centralized in the n-th word: 


r = [u, 




We use a special padding token for the words with 
indices outside of the sentence boundaries. 


Next, the vector r is processed by two usual 
neural network layers, which extract one more 
level of representation and compute the scores: 

s{wn) = W‘^h{W^r + h^) + h’^ ( 2 ) 


where matrices G 


T)hl 


xhiu^ and vectors G 


and 


are parameters to be learned. The trans- 


6^ G 

fer function h{.) is the hyperbolic tangent. The 
size of the context window and the number 
of hidden units hl^ are hyperparameters to be cho¬ 
sen by the user. 

Like in ( [Collobert et ah, lOTT I, CharWNN uses 
a prediction scheme that takes into account the 
sentence structure. The method uses a transi¬ 
tion score Atu for jumping from tag f G T to 
u G T in successive words, and a score A^t for 
starting from the t-th tag. Given the sentence 
[w]i = (mi, m2, ...,mAr}, the score for tag path 
[t]i = •••jftv} is computed as follows: 

N 

S (Hf, [t]l,d) = + s{Wn)tr,) 

n=l 

( 3 ) 




































































































































where s{wn)tn the score given for tag at word 
Wn and 9 is the set of all trainable network param¬ 
eters W^, 6°, W\b\W^, 52 , A). 

After scoring each word in the sentence, the pre¬ 
dicted sequence is inferred with the Viterbi algo¬ 
rithm. 

2.3 Network Training 

We train CharWNN by minimizing a negative 
likelihood over the training set D. In the same way 


as in (Collobert et ah, 20111, we interpret the sen¬ 
tence score <01 as a conditional probability over a 
path. For this purpose, we exponentiate the score 
Q and normalize it with respect to all possible 
paths. Taking the log, we arrive at the following 
conditional log-probability: 

logp([t]f|[n;]f,6l) = S ([m]f, , 0) 

-log I I (4) 

The log-likelihood in Equation can be com¬ 
puted efficiently using dynamic programming 
( Collobert, 201 1| ). We use stochastic gradient 
descent (SGD) to minimize the negative log- 
likelihood with respect to 6. We use the backprop- 
agation algorithm to compute the gradients of the 
neural network. We implemented CharWNN us¬ 


ing the Theano library (Bergstra et ah, 20101. 


3 Experimental Setup 

3.1 Unsupervised Learning of Word 
Embeddings 

The word embeddings used in our experiments 
are initialized by means of unsupervised pre¬ 
training. We perform pre-training of word- 
level embeddings using the skip-gram NN archi¬ 
tecture ( [Mikolov et ah, 2013| l available in the 
word2veclUtool. 

In our experiments on Portuguese NER, we use 
the word-level embeddings previously trained by 
( |dos Santos and Zadrozny, 2014 1. They have used 
a corpus composed of the Portuguese Wikipedia, 
the CETENPolh£0 corpus and the CETEMPub- 
liccj^corpus. 

In our experiments on Spanish NER, we use 
the Spanish Wikipedia. We process the Span¬ 
ish Wikipedia corpus using the same steps used 


'http://code.google.eom/p/word2vec/ 

^http://www.linguateca.pt/cetenfolha/ 

^http://www.linguateca.pt/cetempublico/ 


by ( |dos Santos and Zadrozny, 2014[ ): (1) remove 
paragraphs that are not in Spanish; (2) substitute 
non-roman characters by a special character; (3) 
tokenize the text using a tokenizer that we have 
implemented; (4) remove sentences that are less 
than 20 characters long (including white spaces) 
or have less than 5 tokens; (5) lowercase all words 
and substitute each numerical digit by a 0. The re¬ 
sulting corpus contains around 450 million tokens. 


Eollowing (dos Santos and Zadrozny, 20141, we 
do not perform unsupervised learning of character- 
level embeddings. The character-level embed¬ 
dings are initialized by randomly sampling each 
value from an uniform distribution: U{—r,r), 


where r = 


6 


I ^chr I _j_ (J^chr ' 


3.2 Corpora 

We use the corpus from the first HAREM 


evaluation (Santos and Cardoso, 20071 in our 


experiments on Portuguese NER. This corpus 
is annotated with ten named entity categories: 
Person (PESSOA), Organization (ORGANIZA- 
CAO), Eocation (EOCAE), Value (VAEOR), Date 
(TEMPO), Abstraction (ABSTRACCAO), Title 
(OBRA), Event (ACONTECIMENTO), Thing 
(COISA) and Other (OUTRO). The HAREM cor¬ 
pus is already divided into two subsets: Eirst 
HAREM and MiniHAREM. Each subset corre¬ 
sponds to a different Portuguese NER contest. 
In our experiments, we call HAREM I the setup 
where we use the Eirst HAREM corpus as the 
training set and the MiniHAREM corpus as the 
test set. This is the same setup used by dos Santos 
and Milidiu (2012). Additionally, we tokenize the 
HAREM corpus and create a development set that 
comprises 5% of the training set. Table [T] present 
some details of this dataset. 

In our experiments on Spanish NER we use 
the SPA CoNEE-2002 Corpus, which was de¬ 


veloped for the CoNEE-2002 shared task (Tjong 


Kim Sang, 20021. It is annotated with four named 
entity categories: Person, Organization, Eocation 
and Miscellaneous. The SPA CoNEE-2002 corpus 
is already divided into training, development and 
test sets. The development set has characteristics 
similar to the test corpora. 

We treat NER as a sequential classification 
problem. Hence, in both corpora we use the I OB 2 
tagging style where: 0, means that the word is not 
a NE; B-X is used for the leftmost word of a NE 
type X; and I-X means that the word is inside of 
a NE type X. The IOB2 tagging style is illustrated 





















Table 1: Named Entity Recognition Corpora. 


Corpus 

Language 

Training Data 

Sentenc. Tokens 

Test Data 

Sentenc. Tokens 

HAREM I 

Portuguese 

4,749 

93,125 

3,393 

62,914 

SPA CoNEE-2002 

Spanish 

8,323 

264,715 

1,517 

51,533 


in the following example. 

Wolff/B-PER ,/O currently/O a/O 
journalist/O in/O Argentina/B-LOC ,/0 
played/O with/O Del/B-PER Bosque/l-PER 
in/O the/O final/O years/O of/O the/O 
seventies/O in/O Real/B-ORG 
Madrid/l-ORG 


3.3 Model Setup 


In most of our experiments, we use the same hy¬ 
perparameters used by dos Santos and Zadrozny 
(2014) for part-of-speech tagging. The only ex¬ 
ception is the learning rate for SPA CoNLL-2002, 
which we set to 0.005 in order to avoid diver¬ 
gence. The hyperparameter values are presented 
in Table We use the development sets to deter¬ 
mine the number of training epochs, which is six 
for HAREM and sixteen for SPA CoNEE-2002. 

We compare CharWNN with two similar neu¬ 
ral network architectures: CharNN and WNN. 
CharNN is equivalent to CharWNN without word 
embeddings, i.e., it uses character-level embed¬ 
dings only. WNN is equivalent to CharWNN with¬ 
out character-level embeddings, i.e., it uses word 
embeddings only. Additionally, in the same way 
as in ( [Collobert et al., 20111 , we check the impact 
of adding to WNN two handcrafted features that 
contain character-level information, namely cap¬ 
italization and suffix. The capitalization feature 
has five possible values: all lowercased, firsf up- 
percased, all uppercased, confains an uppercased 
leffer, and all ofher cases. We use suffix of size 
fhree. In our experimenfs, bofh capifalizafion and 
suffix embeddings have dimension five. The hy- 
perparamefers values for fhese two NNs are shown 
in Table 12 


4 Experimental Results 

4.1 Results for Spanish NER 

In Table we report the performance of different 
NNs for the SPA CoNEE-2002 corpus. All results 
for this corpus were computed using the CoNEE- 
2002 evaluation scripj^ CharWNN achieves the 

"'http://www.cnts.ua.ac.be/conll2002/ner/bin/conlleval.txt 


best precision, recall and El in both development 
and test sets. Eor the test set, the El of CharWNN 
is 3 points larger than the El of the WNN that uses 
two additional handcrafted features: suffixes and 
capifalizafion. This resulf suggesfs fhaf, for fhe 
NER fask, fhe characfer-level embeddings are as 
or more effecfive as fhe fwo characfer-level fea- 
fures used in WNN. Similar resulfs were obtained 
by dos Santos and Zadrozny (2014) in the POS 
tagging task. 

In the two last lines of Table |3] we can see the 
results of using word embeddings and character- 
level embeddings separately. Both, WNN that 
uses word embeddings only and CharNN, do not 
achieve results competitive with the results of the 
networks that jointly use word-level and character- 
level information. This is not surprising, since 
it is already known in the NEP community that 
jointly using word-level and character-level fea¬ 
tures is important to perform named entity recog¬ 
nition. 

In Table we compare CharWNN results with 
the ones of a state-of-the-art system for the SPA 
CoNEE-2002 Corpus. This system was trained us¬ 


ing AdaBoost and is described in (Carreras et al.. 


20021. It employs decision trees as a base learner 


and uses handcrafted features as input. Among 
others, these features include gazetteers with peo¬ 
ple names and geographical location names. The 
AdaBoost based system divide the NER task into 
two intermediate sub-tasks: NE identification and 
NE classification. In the first sub-task, the system 
identifies NE candidates. In the second sub-task, 
the system classifies fhe idenfified candidates. In 
Table we can see that even using only automat¬ 
ically learned features, CharWNN achieves state- 
of-the-art results for the SPA CoNEE-2002. This 
is an impressive result, since NER is a challenging 
task to perform without the use of gazetteers. 


4.2 Results for Portuguese NER 

In Table we report the performance of different 
NNs for the HAREM I corpus. The results in this 
table were computed using the CoNEE-2002 eval- 













Table 2: Neural Network Hyperparameters. 


Parameter 

Parameter Name 

CharWNN 

WNN 

CharNN 


Word embedding dimensions 

100 

100 

- 

j^wrd 

Word context window size 

5 

5 

5 

^chr 

Char, embedding dimensions 

10 

- 

50 

j^chr 

Char, context window size 

5 

- 

5 

clu 

Convolutional units 

50 

- 

200 


Hidden units 

300 

300 

300 

A 

Eeaming rate 

0.0075 

0.0075 

0.0075 


Table 3: Comparison of different NNs for the SPA CoNLL-2002 corpus. 


NN 

Features 

Dev. Set 

Test Set 

Prec. 

Rec. 

FI 

Prec. 

Rec. 

FI 

CharWNN 

word emb., char emb. 

80.13 

78.68 

79.40 

82.21 

82.21 

82.21 

WNN 

word emb., suffix, capit. 

78.33 

76.31 

77.30 

79.64 

78.67 

79.15 

WNN 

word embeddings 

73.87 

68.45 

71.06 

73.77 

68.19 

70.87 

CharNN 

char embeddings 

53.86 

51.40 

52.60 

61.13 

59.03 

60.06 


Table 4: Comparison with the state-of-the-art for the SPA CoNLL-2002 corpus. 


System 

Features 

Prec. 

Rec. 

FI 

CharWNN 

word embeddings, char embeddings 

82.21 

82.21 

82.21 


words, ortographic, POS tags, trigger words. 




AdaBoost 

bag-of-words, gazetteers, word suffixes. 

81.38 

81.40 

81.39 


word type patterns, entity length 





uation script. We report results in two scenarios: 
total and selective. In the total scenario, all ten 
categories are taken into account when scoring the 
systems. In the selective scenario, only five chosen 
categories (Person, Organization, Location, Date 
and Value) are taken into account. We can see 
in Table that CharWNN and WNN that uses 
two additional handcrafted features have similar 
results. We think that by increasing the training 
data, CharWNN has the potential to learn better 
character embeddings and outperform WNN, like 
happens in the SPA CoNLL-2002 corpus, which is 
larger than the HAREM I corpus. Again, CharNN 
and WNN that uses word embeddings only, do not 
achieve results competitive with the results of the 
networks that jointly use word-level and character- 
level information. 

In order to compare CharWNN results with 
the one of the state-of-the-art system, we report 
in tables and the precision, recall, and FI 
scores computed with the evaluation scripts from 
the HAREM I competitiorp] ( [Santos and Cardoso, 

^http://www.linguateca.pt/primeiroHAREM/harem Ar- 


20071, which uses a scoring strategy different from 
the CoNEE-2002 evaluation script. 

In Table we compare CharWNN results with 
the ones of ETEcmt. a state-of-the-art system for 


the HAREM I Corpus (dos Santos and Milidiu, 


20121 . ETEcmt is an ensemble method that uses 


Entropy Guided Transformation Eeaming (ETE) 
as the base learner. The ETEcmt system uses 
handcrafted features like gazetteers and dictionar¬ 
ies as well as the output of other NEP tasks such as 
POS tagging and noun phrase (NP) chunking. As 
we can see in Table [^ CharWNN outperforms the 
state-of-the-art system by a large margin in both 
total and selective scenarios, which is an remark¬ 
able result for a system that uses automatically 
learned features only. 

In Table [^ we compare CharWNN results by 
entity type with the ones of ETEcmt- These 
results were computed in the selective scenario. 
CharWNN produces a much better recall than 
ETEcmt for the classes EOC, PER and ORG. For 
the ORG entity, the improvement is of 21 points 


quitectura.html 

































Table 5: Comparison of different NNs for the HAREM I corpus. 


NN 

Features 

Total Scenario 

Selective Scenario 

Free. 

Rec. 

FI 

Free. 

Rec. 

FI 

CharWNN 

word emb., char emb. 

67.16 

63.74 

65.41 

73.98 

68.68 

71.23 

WNN 

word emb., suffix, capit. 

68.52 

63.16 

65.73 

75.05 

68.35 

71.54 

WNN 

word embeddings 

63.32 

53.23 

57.84 

68.91 

58.77 

63.44 

CharNN 

char embeddings 

57.10 

50.65 

53.68 

66.30 

54.54 

59.85 


Table 6: Comparison with the State-of-the-art for the HAREM I corpus. 


System 

Features 

Total Scenario 

Selective Scenario 

Free. 

Rec. 

FI 

Free. 

Rec. 

FI 

CharWNN 

word emb., char emb. 

74.54 

68.53 

71.41 

78.38 

77.49 

77.93 

ETEcmt 

words, POS tags, NP tags, 
capitalization, word length, 
dictionaries, gazetteers 

77.52 

53.86 

63.56 

77.27 

65.20 

70.72 


in the recall. We believe that a large part of this 
boost in the recall is due to the unsupervised pre¬ 
training of word embeddings, which can leverage 
large amounts of unlabeled data to produce reli¬ 
able word representations. 

4.3 Impact of unsupervised pre-training of 
word embeddings 

In Table we assess the impact of unsuper¬ 
vised pre-training of word embeddings in Char- 
WNN performance for both SPA CoNEE-2002 
and HAREM I (selective). The results were com¬ 
puted using the CoNEE-2002 evaluation script. 
Eor both corpora, CharWNN results are improved 
when using unsupervised pre-training. The im¬ 
pact of unsupervised pre-training is larger for the 
HAREM I corpus (13.2 points in the El) than for 
the SPA CoNEE-2002 (4.3 points in the El). We 
believe one of the main reasons of this difference 
in the impact is the training set size, which is much 
smaller in the HAREM I corpus. 

5 Related Work 


Some recent work on deep learning for named en¬ 
tity recognition include Chen et al. (20101, Col- 
lobert et al. ( |2011| ) and Passos et al. ( |2014 i. 

Chen et al. (2010 1 employ deep belief networks 
(DBN) to perform named entity categorization. In 
their system, they assume that the boundaries of 
all the entity mentions were previously identified, 
which makes their task easier than the one we 
tackle in this paper. The input for their model is 
the character-level information of the entity to be 


classified. They apply their system for a Chinese 
corpus and achieve state-of-the-art results for the 
NE categorization task. 


Collobert et al. (20111 propose a deep neural 


network which is equivalent to the WNN architec¬ 
ture described in Section |3.3| They achieve state- 
of-the-art results for English NER by adding a fea¬ 
ture based on gazetteer information. 


Passos et al. (2014l extend the Skip-Gram 
language model ( Mikolov et al., 2013| to pro¬ 
duce phrase embeddings that are more suitable 
to be used in a linear-chain CRE to perform 
NER. Their linear-chain CRE, which also uses 
additional handcrafted features such as gazetteer 
based, achieves state-of-the-art results on two En¬ 
glish corpora: CoNEE 2003 and Ontonotes NER. 

The main difference between our approach and 
the ones proposed in previous work is the use of 
neural character embeddings. This type of em¬ 
bedding allows us to achieve state-of-the-art re¬ 
sults for the full task of identifying and classify¬ 
ing named entities using only features automati¬ 
cally learned. Additionally, we perform experi¬ 
ments with two different languages, while previ¬ 
ous work focused in one language. 


6 Conclusions 


In this work we approach language-independent 
NER using a DNN that employs word- and 
character-level embeddings to perform sequential 
classihcation. We demonstrate that the same DNN 
which was successfully applied for POS tagging 
can also achieve state-of-the-art results for NER, 





























Table 7: Results by entity type for the HAREM I eorpus. 


Entity 

CharWNN 

ETLcmt 

Free. 

Rec. 

FI 

Free. 

Rec. 

FI 

DATE 

90.27 

81.32 

85.56 

88.29 

82.21 

85.14 

EOC 

76.91 

78.55 

77.72 

76.18 

68.16 

71.95 

ORG 

70.65 

71.56 

71.10 

65.34 

50.29 

56.84 

PER 

81.35 

77.07 

79.15 

81.49 

61.14 

69.87 

VAEUE 

78.08 

74.99 

76.51 

77.72 

70.13 

73.73 

Overall 

78.38 

77.49 

77.93 

77.27 

65.20 

70.72 


Table 8: Impaet of unsup. pre-training of word emb. in CharWNN performanee. 


Corpus 

Pre-trained word emb. 

Precision 

Recall 

FI 

SPA CoNEE-2002 

Yes 

82.21 

82.21 

82.21 

No 

78.21 

77.63 

77.92 

HAREM I 

Yes 

73.98 

68.68 

71.23 

No 

65.21 

52.27 

58.03 


using the same hyperparameters, and without any 
handerafted features. Moreover, we shade some 
light on the eontribution of neural eharaeter em¬ 
beddings for NER; and define new state-of-the-art 
results for Portuguese and Spanish NER. 
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